Moving beyond prototypes 
Building resilience at scale 
in your LoT application 
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Alina Dima 
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Senior Developer Advocate, AWS loT 
Global service team. 


Love problem solving. 
Close to 20 years engineering 


experience, designing and building 
Serverless and loT solutions at scale. 


Goals of the session 


Build a maturity 


model. 
Identify the tools 
to mitigate the 
challenges. 
Helo you make 
informed 
decisions on how 
Understand some to mitigate the 
of the challenges challenges. 
that lead to data e We will discuss resilience as a crucial factor for 
loss, data data consistency in loT applications. 


inaccuracy, or 
data delays in loT 
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applications. This session is not a comparison between tools. 


e Itis nota comparison between AWS services. 
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Introduction to loT 
and MQTT 
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What is loT? 


+ Billions of connected devices 
and generating data and 
actuating. 


* Data can become an 
organizational asset. But only if 
it is reliable. 


+ |t is not possible to have reliable 
data with a non-resilient loT 
application. 
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Introduction to MQTT 


Subscrib 
er 


Publisher A 


Broker 


Subscrib 
er 


Publisher 


(loT). 
* Application layer, typically over TCP/IP. 
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Bi-directional communication. 
Decoupled producers and 
consumers. 

Catered for unreliable 
connectivity scenarios. 


* MQTT is an OASIS standard messaging protocol for the Internet of Things 


AWS loT 


AWS Cloud 


B En EC 
AWS loT Core 


loT things 
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AWS Lambda 


Amazon DynamoDB 


Other AWS/Amazon Services 


Challenges you might face when building loT 


* Complexity and constraints. 


+ Getting started and moving 
past prototypes. 


* Problems are hard to identify 
early, let alone fix. 


* Shared responsibility model - 
loT is a means to an end. 


e Not just Cloud versus your 
Application. 


* But also between your different 
internal teams. 
UE 


Why do we care abot 
resilience In loT 
applications ? 


Let's look at a real world scenario 
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Let's look at a real world scenario 
* Monitor elevator activity. 


* Read and understand elevator 
data: 


* Door openings/closings and speeds. 
* Speed of movement. 


+ How long does a run take? 


* Predict failures and auto- 
schedule maintenance 
windows. 
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BUT, we have a problem with the existing 


solution... | 
* On production, the solution does not meet the 
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requirements: 


Devices also go offline and stay offline or fluctuate between offline 
and online. 


Data is missing: real time events and historical data gaps. 
Data has been flagged as unreliable by the analytics team. 


Engineers across different teams cannot agree where the problem 
les. 


High Level Architecture 
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What can go wrong? 
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AWS loT Core loT data topic(s) loT rule(s) Amazon OpenSearch 


Amazon S3 


onnection lost or unstable 
ascades to unpredictable edge behaviour 


Cascades to messages not sent 
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AWS loT Core 


loT data topic(s) 


loT rule(s) 


Amazon OpenSearch 


Amazon S3 
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Ingestion fails 
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AWS loT Core loT data topic(s) loT rule(s) e Amazon OpenSearch 


Writes fail 


Amazon S3 


MQTT Client 
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AWS loT Core loT data topic(s) loT rule(s) Amazon OpenSearch 
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Amazon $3 


fail 


* What about data 
consistency and 
completeness? 


MQTT Client 


e What about scale? 
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How do we solve 
issues? 
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How do we solve 
reliability issues? 


By building resilience! 
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Resilient MQTT Connection 


MQTT Connection 
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Resilient MQTT Connection 


MQTT Protocol 


Keep Alive /Heartbeat. 


Client Takeover. 


Last Will Testament (LWT). 


Persistent Sessions. 


MQTT Connection 
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Resilient MQTT Connection 


MQTT Protocol 


Keep Alive /Heartbeat. 


Client Takeover. 


Last Will Testament (LWT). 


Persistent Sessions. 
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MQTT Connection 


Application Level 


Understand your MQTT client 
library. 


Manage the MQTT connection 
lifecycle. 


Listen for/handle MQTT events, 
connects, interrupts, resumes 
etc. 


Connection State Checks. 


Track and recover from 
connection 
errors (client and server-side). 


Automatic reconnects, with jitter 
and/or exponential backoff 
strategy. 


Design devices to have an 
accurate time. 


Use tools that allow you to test 
your MQTT implementation, like 
AWS loT Device Advisor. 


Resilient Message Delivery 


Message Delivery 
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Resilient Message Delivery 


MQTT Protocol 


Quality of Service. - QoS 1 
for reliable message 
transmission. 


Message Queueing. 


Retained Messages. 


Message Delivery 


Persistent Sessions. 
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Resilient Message Delivery 


Application Level 


Encapsulate the MQTT 
transport layer. 


MQTT Protocol 


Quality of Service. - QoS 1 
for reliable message 
transmission. 


MQTT message buffering 
for short time connection 
loss. 


Track success/failure of 
message delivery - 
PUBACKs. 


Mitigate failed delivery. 


Message Delivery Offline data storage 
strategy. 


Message Queueing. 


Retained Messages. 


Optimize data sent from 
devices to backend 
services. 


Persistent Sessions. 
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Runtime resilience 


App. Runtime 
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Runtime resilience 


Application Level 


Ensure your application process 
recovers and restarts. 


Use a process management 
tool. 


Ensure graceful exit handling. 


App. Runtime 


Application logs and metrics. 
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Abstractions and Data Contracts 


Data Contracts 
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Application Level 


Define your application 
domain in code: events, 
metrics, activity, etc. 


Enforce data contracts. No 
breaking changes. 


Establish and enforce SLAs. 


Send machine raw data, only 
if it's what your consumers 
need. 


But be ready to make raw 
data available during early 
discovery phases. 


Validate. 


aws 
Nuo“ J 


Abstractions and Data Contracts 


Data Contracts 


End-to-End Data Integrity 


Data Integrity 
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End-to-End Data Integrity 


Application Level 


Craft what integrity means for 
your application. 


Ensure eventual data 
consistency: 

e Checksums, 

* Timestamps in messages, 
e Fill in data gaps. 


Decouple data integrity checks 
from your ingestion. 


Log, monitor and alert on data 
integrity issues. 
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Data Integrity 


Mitigate all points of failure in your application 


Points of Failure 
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Mitigate all points of failure in your application 


Application Level 


Pick the highest abstraction you 
can accommodate: 
* Or you are signing up for Points of Failure 
increased ownership 
(maintenance). 


Answer the question: What 
happens if it fails? 


Manage your points of failures 
with fallbacks. 


Retry with backoff strategy. 
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if you can see it, you can fix it! - Observability 


Application Level 


Standardized and centralized 
application and service logging. 


Metrics. 


Tracing. 


Compile, analyse, set 
thresholds, alert/notify. 
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Ready for Scale? 


Application Level (Part 1) 


Application Level (Part 2) 


Build behaviour for scale in 

your application: 

- Scalable reception of data, 
at different reporting 
frequencies and volumes, 

- Scalable routing, 


Use managed services when you 
can. 


Understand load, service SLAs 
and if you need to increase 
service limits. 


Devices should un-align their - Fan out, 
reporting intervals. - Decouple, 
- Retries/backoff, 


- Error Handling, 
- Observability 


Your application can identify 
what went wrong, and 
recover. 
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How does this look 
like end-to-end? 
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Writes fail . - Mm 


Amazon S3 


AWS loT Core loT data topic(s) loT rule(s) Amazon OpenSearch 


e What about observability? 


MQTT Client 
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Pre-processing 
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loT data topic(s) loT rule(s) Amazon Kinesis Amazon 
Data Streams Kinesis 
Data Firehese 


AWS loT Core 
Batch Writes 


Amazon OpenSearch 


Failure and/or Backup 


e Error Action 
Amazon S3 


other 


Services 
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* Failed files from S3 should be re-ingested via a 


App. n 
Process SSR decoupled flow. 

| L 
SE e Other services can consume the raw data. 
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Pre-processing 
Lambda 


Data Streams Kinesis 
Data Firehese 


Failure and/or Backup 


Batch Writes 


Amazon OpenSearch 


other 
Services 


e Error Action 


Amazon S3 
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AWS loT Device Fleet Hub 
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Amazon CloudWatch AWS X- 
Ray Defender 


Maturity Model for 
Resilient loT 
Applications 
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2 Spike/Proof of 
zm concept 1 device 


100s of device 


Mitigate and be sure it works 


Ready for Large 


Scale Millions of 


devices 
aws 
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MATURITY MODEL FOR RESILIENT IOT APPLICATIONS 


Ea Phases/Steps Edge Application/ One Instance Cloud Application - handles many Edge application instances 


1 Spike/Proof of © Can connect via MQTT. * Receives data from an edge application instance. 


concept Default client library MQTT e Sends that data downstream. 
configuration can be used. 


Can publish/subscribe via MQTTT. 


2 Hope it works © Use MQTT application protocol features + Use the highest level of abstraction if possible: for example Cloud managed, serverless 
for application resilience. technologies, dynamic scaling based on load. 
+ What can you just turn on easily? 
QoS 1, LWT, persistent sessions, 
retained messages, connection 
retries with sensible backoff 
Strategies. 
e Listen on and handle MQTT connection 
lifecycle events. 


3 Know if it + Build logging. 
works + Log and handle message delivery failures. 
e Design for eventual data consistency, and log inconsistencies. 
e Use testing tools like the AWS loT Device Advisor for testing your MQTT connection resilience and security. 


«implement storage at edge. «Configure metrics, alarms, tracing. 
* Consciously decide how much data you Validate reliable and secure connectivity with AWS loT Core using AWS loT Core 
need to store to handle offline-mode. Device Advisor, or similar. 

4 Mitigate and eldentify and handle all points of failure. eldentify and handle all points of failure (for example: you use an loT Rule, have an Error 
be sure it Action and write to storage, handle exceptions, tell managed services how to handle 
works errors). 

e Ensure your domain data is consistent - sanity checks, check sum algorithms. 

5 Ready for large * Handle millions of requests/messages. 
scale e AIl calls to external services are surrounded by retries, with reasonable backoff 

strategies. 


«Understand and design taking into account third party system SLAs. 

«Watch out for increasing service limits of Cloud services. 

*Decouple and fan out. 

eSet up monitoring tools: for example Amazon CloudWatch Logs, Metrics, Insights, AWS 
lol Device Defender, Fleet Hub. 

e Alert the right teams on exceeded thresholds. 

els your application ready to continuously learn and cope? 


Key Takeaways 
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Key Takeaways 


* Accurate insights are not 
possible with unreliable data. 


e Resilience Is the mechanism to 
achieve reliability. 


* External & internal factors can 
cause unreliability in loT 
applications. 


* Resilience must be built in and 
we can use a maturity model 
for resilience. 
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Spike/PoC 


Hope it works 


Know if it 


works - 
Mitigate 


Ready for scale 


Resources 


e AWS loT Core Device Advisor: 


nttps://docs.aws.amazon.com/IoUfatest/develo 
pergulde/device-advisor.htm 


e AWS loT Device SDK: 


https://github.com/aws/aws-lIot-device-sdk-Js-v 


More Content 


e AWS lol Greengrass: 
https://github.com/aws-greengrass 


* — Blogs/Posts: https://dev.to/iotbuilders 


e lol Dev YouTube: 
https://voutube.com/@iotbullders 
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Thank you! 


Alina Dima 


Senior Developer 
Advocate 


AWS loT 


Connect with me 


